Mini-Project 03 - Visualizing and Maintaining the Green Canopy of NYC

Introduction

When you’re in a city full of high-risers and densely packed apartments, its easy to forget about the importance of having trees populated in every street. They provide many benefits to the city and its residents. They reduce air pollution, cool sidewalks with shade, reduce toxic runoff, and add some extra wildlife to our cities. Thankfully, we have official organizations like the NYC Parks & Recreation center to track and take care of trees throughout our city.

In this project, I will be using two different datasets from NYC Open Data and The Department of City Planning. The former will be used to analyze all sorts of information regarding trees planted throughout NYC, and the latter for constructing maps.

Data Acquisition and Preparation

###Packages

As usual, multiple packages will be used for this project. A few extra packages are being used for this project, such as sf, which allows spacial data to be read, and perform spatial operations.

Show the code
ensure_package <- function(pkg) {
  pkg <- as.character(substitute(pkg))          
  options(repos = c(CRAN = "https://cloud.r-project.org"))
  
  if (!require(pkg, character.only = TRUE, quietly = TRUE)) {
    install.packages(pkg)
  }
  
  if (!require(pkg, character.only = TRUE, quietly = TRUE)) {
    stop(paste("Package", pkg, "could not be loaded"))
  }  
}

ensure_package(httr2)
ensure_package(sf)
ensure_package(dplyr)
ensure_package(glue)
ensure_package(extrafont)
ensure_package(ggplot2)
ensure_package(leaflet)
ensure_package(ggspatial)
ensure_package(patchwork)
ensure_package(ggthemes)
ensure_package(scico)
ensure_package(stringr)
ensure_package(kableExtra)
ensure_package(scales)

City Council Districts

Let’s download the City council districts from The Department of City Planning website. Essentially, this dataset contains the boundaries of every single district in NYC. This is crucial for future analysis, as majority of the exploratory analysis will be conducted through geospacial data analysis, essentially making maps.

As mentioned before, we will be using the sf package to download the data correctly, with the function st_read. In order to avoid repeatably downloading the data, I made the choice to download the data as a function to check if the data has already been downloaded, and only download it if it hasn’t. Since this file is hosted as a static file, we can quickly download it and make a local directory to store the data

Show the code
#| echo: true
#| include: true
#| output: false
download_nycc_boundaries <- function(
  url = "https://s-media.nyc.gov/agencies/dcp/assets/files/zip/data-tools/bytes/city-council/nycc_25c.zip"
) {
  # Set destination folder and file paths
  dest_dir <- file.path("data", "mp03")
  zip_file <- file.path(dest_dir, basename(url))

  # Create directory if it doesn't exist
  if (!dir.exists(dest_dir)) {
    dir.create(dest_dir, recursive = TRUE)
  }

  # Download zip file if it doesn't exist
  if (!file.exists(zip_file)) {
    download.file(url, destfile = zip_file, mode = "wb")
  }

  # List contents of the zip to find the .shp file
  zip_contents <- unzip(zip_file, list = TRUE)
  shp_file_rel <- zip_contents$Name[grepl("\\.shp$", zip_contents$Name, ignore.case = TRUE)]

  if (length(shp_file_rel) == 0L) {
    stop("No .shp file found in the zip archive.")
  }

  # Use the first .shp file found
  shp_file_rel <- shp_file_rel[1]
  shp_file <- file.path(dest_dir, shp_file_rel)

  # Unzip only if shapefile doesn't exist
  if (!file.exists(shp_file)) {
    unzip(zip_file, exdir = dest_dir)
  }

  # Read and transform shapefile
  sf_data <- st_read(shp_file, quiet = TRUE)
  sf_data_wgs84 <- st_transform(sf_data, crs = "WGS84")

  return(sf_data_wgs84)
}

  
nyc_boundaries <- download_nycc_boundaries()

NYC Open Data

The next dataset will be from Forestry Tree Points on the NYC Open Data site from the provided API. However, downloading this dataset like the last one leads to downloading the first 100 trees. This is a problem, because we need all of the data to accurately analyze the data. To fix this problem, I increased the limit of downloads from 1,000 to 100,000 rows. I then found out that the dataset is much larger than that, so i made a plan to make another function. The main purpose of it is to repeatedly download the data at batches of 100,000 rows, and eventually stop if the amount of rows it downloads is less than the batch amount. Finally, we bind all of the rows together create one complete dataset.

With 11 files created in this download, this means that there are likely more than one million trees in this city! There one thing to note, this dataset does not include parks; the amount of trees in NYC would likely be even higher than before. This site also has a data dictionary for us to reference if we are unsure what a variable means.

Show the code
# Function to download NYC tree points data in chunks and combine into one sf object

download_nyc_tree_points <- function(
  base_url = "https://data.cityofnewyork.us/resource/hn5i-inap.geojson",
  limit = 100000  # Number of rows to fetch per request
) {
  # Directory where downloaded files will be saved
  dest_dir <- file.path("data", "mp03")

  # Initialize offset for pagination (start at 0)
  offset <- 0

  # List to store all downloaded sf objects
  all_files <- list()

  # Repeat until no more data is returned
  repeat {
    # Convert offset to string (avoid scientific notation)
    offset_str <- format(offset, scientific = FALSE)

    # Create file name for this chunk based on offset
    file_name <- glue("tree_points_{offset_str}.geojson")
    file_path <- file.path(dest_dir, file_name)

    # If file does not exist locally, download it
    if (!file.exists(file_path)) {
      # Build the request using httr2
      req <- request(base_url) |>
        req_url_query(`$limit` = limit, `$offset` = offset_str) |> 
        # Add query params for pagination
        req_user_agent("Baruch College Mini-Project for Student, API access") |> 
        req_perform()

      # Save raw response body to file
      writeBin(resp_body_raw(req), file_path)
    }

    # Try reading the GeoJSON file into an sf object
    sf_data <- tryCatch({
      st_read(file_path, quiet = TRUE) # Read spatial data quietly
    }, error = function(e) {
      # If reading fails, print message and return NULL
      message(glue("Failed to read file at offset {offset_str}: {e$message}"))
      return(NULL)
    })

    # If no data returned or read failed, break the loop
    if (is.null(sf_data) || nrow(sf_data) == 0) {
      break
    }

    # Normalize planteddate column to character (avoid issues with mixed types)
    if ("planteddate" %in% names(sf_data)) {
      sf_data$planteddate <- as.character(sf_data$planteddate)
    }
    # Append this chunk to the list
    all_files[[length(all_files) + 1]] <- sf_data
    # If fewer rows than limit were returned, stop
    if (nrow(sf_data) < limit) {
      break
    }

    # Otherwise, increase offset and continue
    offset <- offset + limit
  }

  # Combine all chunks into one data frame
  combined_data <- bind_rows(all_files)

  # Return the combined sf object
  return(combined_data)
}



tree_points <- download_nyc_tree_points()

Exploratory Analysis

Mapping NYC Trees

As said before, there are a lot of trees in NYC, likely more than a million. A simple map plotting points of all trees would not work well, dots would overlap each other, and overall just make a big green blob. Changing the alpha or size wouldn’t help much either. So, I went the route of using leaflet instead of ggplot for this specific map. Here, you can zoom in and click to see the species and overall health of the tree.

Show the code
tree_points_samp<- tree_points|>
  slice_sample(n=10000)


# Clean species names and prepare color mapping
tree_points_map <- tree_points_samp  |>
  mutate(
    genusspecies = str_to_title(str_extract(genusspecies, "[^-]+$")),
    color = case_when(
      tpcondition == "Excellent" ~ "#006400",  
      tpcondition == "Good" ~ "#2A9E00",       
      tpcondition == "Fair" ~ "#FFD700",       
      tpcondition == "Poor" ~ "#FF8C00",       
      tpcondition == "Critical" ~ "#FF4500",   
      tpcondition == "Dead" ~ "#8B0000",       
      TRUE ~ "#A9A9A9"                         
    )
  )

# Build interactive map
leaflet(tree_points_map) |>
  addProviderTiles("CartoDB.Positron") |>
  addPolygons(
    data = nyc_boundaries,
    color = "black",
    weight = 1,
    fillColor = "grey90"
  ) |>
  addCircleMarkers(
    radius = 3,
    color = ~color,  # Use dynamic color
    stroke = FALSE,
    fillOpacity = 0.6,
    popup = ~paste(
      "<b>Species:</b>", genusspecies,
      "<br><b>Condition:</b>", tpcondition
    )
  )

District-level Analysis of Trees

Which council district has the most trees?

In order to conduct further analysis on NYC trees, I joined the Tree Points dataset onto the District Boundaries using st_join and st_contains, which allowed me to perform a spacial join. However, loading issues appear with the geometry whenever there are summarizations in a code block. A quick fix is to use st_drop_geometry, and either join it back using a regular join function later, or call the variable later on in the plot.

Show the code
joined <- st_join(nyc_boundaries, tree_points_samp, join = st_contains)


highlight <- joined |>
  group_by(CounDist) |>
  st_drop_geometry() |> #removing geometry for quick computations
  summarise(num_trees = n()) |>
  slice_max(num_trees, n = 1)

# Plot
joined |>
  group_by(CounDist) |>
  mutate(num_trees = n()) |>
  slice(1) |>  # keep one polygon per district
  ungroup() |>
  ggplot() +
  geom_sf(aes(fill = num_trees), color = "white") +
  # highlighting district with highest num of trees
  geom_sf(data = filter(joined, CounDist == highlight$CounDist),
        fill = NA, color = "red", size = 1.2)+
  scale_fill_gradient(low = "lightgreen", high = "darkgreen") +
  theme_void() +
  labs(
    title = "Number of Trees per District",
    caption = glue("District {highlight$CounDist} (highlighted in red) has the most trees ({highlight$num_trees})"),
    fill = "Number of Trees"
  )

Which council district has the highest density of trees?

This plot is similar to the previous one, with the difference being tree density instead of the amount of trees. However, districts vary in size, and larger districs will tend to have more trees because of this. For example, district 51

Having more trees does not automatically mean the district is doing well, as the district could proportionally be larger to other districts, and have the same amount of trees per each block. Checking by tree density means there are more trees overall for the size

Show the code
#Find the densities of each area
tree_density <- joined |>
  st_drop_geometry() |> #removing geometry for quick computations
  group_by(CounDist) |>
  summarise(
    n = n(),
    area = first(Shape_Area), #fixing an issue with summarizing 
    .groups = "drop"
  ) |>
  mutate(
    trees_per_mi = round(n / (area/ 27878400), 1)  # ft² -> mi²
  )

#Join back to geometry; keep all districts
tree_density <- nyc_boundaries |>
  select(CounDist, geometry, Shape_Area) |>
  left_join(tree_density, by = "CounDist")

# Compute the district with the max density 
max_density<- tree_density |>
  st_drop_geometry() |>
  filter(trees_per_mi == max(trees_per_mi)) |>
  slice(1)


# Plot 
ggplot(tree_density) +
  geom_sf(aes(fill = trees_per_mi), color = "white", linewidth = 0.2) +
    geom_sf(data = filter(joined, CounDist == max_density$CounDist),
        fill = NA, color = "red", size = 1.2)+
  scale_fill_gradient(
    low  = "lightgreen",
    high = "darkgreen",
    name = expression("Trees per mi"^2)
  ) +
  labs(
    title   = "Tree Density per District",
    caption = paste0("District ", max_density$CounDist, " has the highest tree density, with ",
                     max_density$trees_per_mi,
                     " Trees per mi²")
    # no `fill =` here—legend title already set in the scale
  ) +
  theme_void() +
  theme(
    plot.title   = element_text(face = "bold", hjust = 0.5),
    legend.position = "right"
  )

Show the code
joined |>
  st_drop_geometry() |>
  select(tpcondition)|>
  distinct(tpcondition)
     tpcondition
1           Fair
1.1         Good
1.5    Excellent
1.8      Unknown
1.15        Poor
1.60        Dead
2.30    Critical
Show the code
dead_ratio<- joined|>
  st_drop_geometry()|> #removing geometry for quick computations
  group_by(CounDist)|>
  summarise(
    total = n(),
    n_dead= length(which(tpcondition== "Dead")),
    dead_frac = n_dead/total,
    .groups = "drop"
  )

#Joining back on Geometry
dead_ratio <- nyc_boundaries |>
  select(CounDist, geometry, Shape_Area) |>
  left_join(dead_ratio , by = "CounDist")

max_dead<- dead_ratio |>
  st_drop_geometry() |>
  filter(dead_frac == max(dead_frac)) |>
  slice(1)



dead_ratio|>
  ggplot()+
  geom_sf(aes(fill = dead_frac), color = "white", linewidth = 0.2)+
  scale_fill_gradientn(
  colours = c( "#FFFFCC","darkgreen", "#4D004B"), # light yellow → green → dark purple
  values = scales::rescale(c(0, 0.5, 1)),
  name = "Dead Tree Ratio"
)+
  theme_void()+
  labs(
    title   = "Fraction of Dead Trees per district",
    caption = paste0(
    "District ", max_dead$CounDist,
    " has the highest ratio of dead trees (", round(max_dead$dead_frac, 2), ")."
))+
  theme(
    plot.title   = element_text(face = "bold", hjust = 0.5),
    legend.position = "right"
  )

Show the code
tree_borough<- joined |>
  mutate(Borough=
  case_when(
    CounDist >= 1 & CounDist <= 10 ~ "Manhattan",
    CounDist >= 11 & CounDist <= 18 ~ "Bronx",
    CounDist >= 19 & CounDist <= 32 ~ "Queens",
    CounDist >= 33 & CounDist <= 48 ~ "Brooklyn",
    CounDist >= 49 & CounDist <= 51 ~ "Staten Island"
  )
  )

tree_borough|>
  st_drop_geometry()|>
  filter(Borough== "Manhattan")|>
  group_by(genusspecies)|>
  summarize(number_of_spec=
              n())|>
  arrange(number_of_spec)|>
  ungroup()|>
  mutate(
    genusspecies= str_to_title(str_extract(genusspecies, "[^-]+$")),
    Percentage = percent(number_of_spec / sum(number_of_spec), accuracy= .01)
  )|>
  slice_max(order_by= number_of_spec, n=5)|>
  mutate(number_of_spec= comma(number_of_spec))|>
  rename(`Species`= genusspecies, `Amount of Trees`=number_of_spec)|>
  kbl(align = c("l","c"), caption = "Top 5 Common Tree Species in Manhattan")|>
  kable_minimal(c("hover", "condensed", "responsive"), full_width = F, font_size=15)|>
  column_spec (1, color= "darkgreen")|>
  column_spec(2, bold = TRUE)|>
  column_spec(3,color= "gray40")
Top 5 Common Tree Species in Manhattan
Species Amount of Trees Percentage
Thornless Honeylocust 175 14.23%
London Planetree 111 9.02%
Callery Pear 78 6.34%
Pin Oak 77 6.26%
Maidenhair Tree 70 5.69%
Show the code
# Baruch College point (lon, lat)
baruch_point <- st_sfc(st_point(c(-73.9836, 40.7402)), crs = 4326)

# Transform both to projected CRS for accurate distance in feet
tree_points_proj <- st_transform(tree_points_samp, 2263)   # NYC State Plane (feet)
baruch_proj <- st_transform(baruch_point, 2263)

# Compute distances and get top 5 closest trees
tree_points_proj |>
  mutate(distance = as.numeric(st_distance(geometry, baruch_proj))) |>  # convert to numeric
  arrange(distance) |>
  slice_head(n = 5) |>
  mutate(
    Species = str_to_title(str_extract(genusspecies, "[^-]+$")),
    distance = round(distance, 1)  # clean numeric, no units
  ) |>
  select(Species, Condition = tpcondition, `Distance (in ft)`= distance)|>
  st_drop_geometry()|>
  kbl(align = c("l", "c", "c"),
      caption = "Top 5 Closest Trees to Baruch College") |>
  kable_minimal(c("hover", "condensed", "responsive"),
                full_width = FALSE, font_size = 15) |>
  column_spec(1, color = "darkgreen") |>
  column_spec(3, bold = TRUE)
Top 5 Closest Trees to Baruch College
Species Condition Distance (in ft)
Chestnut Oak Good 611.1
Unknown Dead 613.9
Maidenhair Tree Fair 709.0
Caucasian Linden Excellent 904.5
Maidenhair Tree Fair 1007.4

Remember that code blocks look like this:

Show the code
# Your code goes in chunks like this
ensure_package(tidyverse) # You will want this line for almost all MPs
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ forcats   1.0.0     ✔ readr     2.1.5
✔ lubridate 1.9.4     ✔ tibble    3.3.0
✔ purrr     1.1.0     ✔ tidyr     1.3.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ readr::col_factor()      masks scales::col_factor()
✖ purrr::discard()         masks scales::discard()
✖ dplyr::filter()          masks stats::filter()
✖ kableExtra::group_rows() masks dplyr::group_rows()
✖ dplyr::lag()             masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Show the code
x <- 1 + 2 + 3

and you can then print variables from chunks:

Show the code
x
[1] 6

or inline, like this: \(x\) is 6.

Final Insights and Deliverable

Code related to the final deliverable of the assignment goes here.


This work ©2025 by tylerml71 was initially prepared as a Mini-Project for STA 9750 at Baruch College. More details about this course can be found at the course site and instructions for this assignment can be found at MP #03